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ABSTRACT 


Knowledge of prerequisite dependencies is crucial to several as- 
pects of learning, from the organization of learning content to the 
selection of personalized remediation or enrichment for each learner. 
As the amount of content is scaled up, however, it becomes increas- 
ingly difficult to manually specify all of the prerequisites among the 
different content parts, necessitating automation. Since existing ap- 
proaches to automatically inferring prerequisite dependencies rely 
on analysis of content (e.g., topic modeling of text) or performance 
(e.g., quiz results tied to content) data, they are not feasible in cases 
where courses have no assessments or only short content pieces 
(e.g., short video segments). In this paper, we propose an algo- 
rithm that extracts prerequisite information using learner behav- 
ioral data instead of content and performance data, and apply it 
to an online short course. By modeling learner interaction with 
course content through a recurrent neural network-based architec- 
ture, our algorithm characterizes the prerequisite structure as latent 
variables, and estimates them from learner behavior. Through eval- 
uation on a dataset of roughly 12,000 learners in a course we hosted 
on our platform, we show that our algorithm excels at both predict- 
ing behavior and revealing fine-granular insights into prerequisite 
dependencies between content segments, with validation provided 
by a course administrator. Our approach of content analytics using 
large-scale behavioral data complements existing approaches that 
focus on course content and/or performance data. 


1. INTRODUCTION 


Recent advances in machine learning and big data have provided 
opportunities to revamp the traditional “one-size-fits-all” approach 
to education. Researchers have developed methods that analyze 
massive learner and content data to provide personalized recom- 
mendations on what actions learners should take, e.g., to read a 
section of a textbook, watch a lecture video, or work on a prac- 
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tice question [19,24]. By catering to the needs of each individual 
learner, such personalization methods can enhance learning effi- 
cacy; see [1] for an overview. 


By specifying an ordering of which learning content should be 
used before others, content prerequisite structures provide impor- 
tant guidance for the design of personalization algorithms. These 
structures may be defined at multiple levels of granularity, from 
across courses to within single pieces of learning content (e.g., 
between chunks of a video), or for specific units of knowledge 
(often termed “knowledge components”, “skills”, or “concepts”). 
Roughly speaking, learning content is deemed the prerequisite of 
another if it contains knowledge that learners have to master before 
studying the other. For example, Calculus is a prerequisite of Dif- 
ferential Equations at the granularity of different courses; learners 
should master the former before they learn the latter. 


Several works have demonstrated the utility of prerequisite struc- 
tures to learning and personalization. For one, [32] showed that 
when instructors do not take these prerequisite structures into ac- 
count when designing their course curriculums, learners do not 
perform as well. Also, [33] showed that learners with high mas- 
tery of prerequisite knowledge are much less likely to become con- 
fused in learning tasks, compared to those with low mastery. More- 
over, the works in [4, 37] showed that an important feature in the 
prediction of a learner’s first responses on a particular skill is the 
learner’s demonstrated mastery level on prerequisite skills. But ex- 
isting methods for extracting prerequisites suffer from important 
drawbacks that we will describe next. 


1.1 Existing Methods for Prerequisite Struc- 


ture Extraction 
Explicit prerequisite structures, like those in [32], are labor-intensive 
to construct manually and rarely available in practice, especially 
when considering fine-granular prerequisites (e.g., between file seg- 
ments). Inexplicit structures on the other hand, such as tables of 
contents in textbooks [18] and knowledge graphs constructed from 
large databases [3], typically only contain weak information about 
prerequisites: they offer some information on how learning con- 
tent should be ordered, but do not necessarily impact learner per- 
formance or behavior. This observation has motivated the devel- 
opment of automated methods for extracting explicit prerequisite 


Proceedings of the 11th International Conference on Educational Data Mining 66 


structures from data. Existing methods of automation can be di- 
vided into two main categories based on the type of data they use: 
(i) learner data and (ii) content data. 


Methods in the first category use one form of learner data al- 
most exclusively: learner performance, which usually consists of 
learners’ responses to assessment/quiz questions. These methods 
have used several different models/algorithms to make inferences 
from performance data, including causal graphs [28], structural 
expectation-maximization [9], Bayesian estimation [14], hypoth- 
esis testing [6], probabilistic association rules [10], convex opti- 
mization [27], correlation/regression analysis [7], and approximate 
Kalman filtering [21]. 


As for the second category, methods have leveraged several forms 
of content data and metadata. [18], for instance, proposed using 
the organization and unit titles in online textbooks to classify be- 
tween prerequisite and outcome concepts. Others have involved 
Wikipedia, either using the content on wiki pages to aid the extrac- 
tion of concept maps in textbooks [34,35] or extracting prerequisite 
structures among the pages themselves [22,31]. While [22] ana- 
lyzed the links between pages, [31] uses both textual content and 
the page creation and modification logs to extract prerequisites. 


The major downside of these existing automation methods is that 
they require substantial learner performance or content data, which 
is not always available or accessible. Corporate training, for ex- 
ample, is a learning scenario in which many courses have few if 
any assessments; performance is in many cases assigned as a sin- 
gle satisfactory/unsatisfactory outcome at the end of the course [8]. 
Methods that extract prerequisite structures based on learner per- 
formance data, then, are not applicable in these settings. On the 
other hand, in many interactive learning environments like educa- 
tional games [23], content data is limited and not easily parsable; in 
these settings, methods to infer prerequisites based on content data 
(especially text) are not applicable. Moreover, in any learning sce- 
nario, as the level at which prerequisites are desired becomes more 
fine-grained, the amount of content data available in each content 
piece becomes smaller. 


As a result, there is a need to develop methods that can extract 
prerequisite structures from sources of data that (i) are abundant 
in different learning scenarios and (ii) can be captured within fine- 
granular pieces of content, especially in settings where content and 
performance data are limited. 


1.2. Our Method and Contributions 


In this paper, we develop the first methodology to extract prereq- 
uisite structures from large-scale learner behavioral data, using a 
novel recurrent neural network (RNN)-based probabilistic model. 
Behavioral data measures learner interaction with course material, 
typically in the form of clickstream logs that are generated based on 
each mouse click; in this way, it can be captured on small pieces of 
content in any online learning scenario. We demonstrate the ability 
of our model to identify prerequisites between fine-granular content 
segments in the setting of online short-courses, where performance 
and content data are limited; for our particular dataset, the entire 
course is less than 15 minutes in duration, and while the 12,000 
learners do not respond to any assessment questions, they generate 
almost 900,000 clickstreams. 


Specifically, our methodology consists of three main steps: 


Feature engineering. First, we analyze the behavioral data cap- 
tured by our online learning platform in terms of a set of learning 
features (Section 2). These features summarize a learner’s behavior 
on each segment of content that they visit as one of four states: low 
or high engagement if they studied the segment, and skipping back 
or forward otherwise. In deriving the formulas to convert from data 
to features, we consider cases of off-task behavior (e.g., idle time) 
that should be filtered out. We also consider content features in our 
model; since the content data is sparse, we embed each segment 
according to pre-trained statistical language models. 


Modeling and inference. Second, we infer the parameters 
of our probabilistic model through training and validation on the 
dataset. The RNN-based learner model we propose (Section 3) 
consists of two main parts: (i) a latent knowledge state transition 
model, which considers how a learner’s knowledge state changes 
based on the segment visited and behavior exhibited, and (11) a 
learner behavior model, which characterizes the probability that the 
learner exhibits a particular behavior based on their current knowl- 
edge gaps. Our model parameters are trained by minimizing cross- 
entropy loss in the prediction of learner behavior on segments they 
Visit. 


Prerequisite analysis. Third, we analyze prerequisite infor- 
mation for our dataset by examining a model parameter matrix 
that specifies dependencies between segments (20 second chunks 
of video in this course). To establish reliability, we start by evalu- 
ating the performance of our model in predicting behavior on our 
dataset (Section 4.2); in doing so, we find that it can obtain over 
85% accuracy and significant improvements over baselines. Then, 
we visualize the prerequisite matrix, discuss its insights it provides, 
and verify them through a questionnaire provided to a course ad- 
ministrator (Section 4.3). 


At the end, we also describe how our model parameters can drive 
content personalization. More generally, we believe that this work 
will motivate a new research thrust in using human behavior to 
aid content analytics: such approaches have the potential to ben- 
efit applications that involve large-scale human-content interaction 
but have only limited content data. 


2. BEHAVIORS AND CONTENT: 
DATA AND FEATURES 


In this section, we detail our methods for processing learner behav- 
ioral data. We first discuss the specific course dataset we consider, 
then the data capture, and finally the computation of features from 
this data that are used in our prerequisite identification algorithm. 


2.1 Course and Enrollment 

The dataset we use comes from an online course on the topic of 
product development that we hosted on our course delivery plat- 
form. This course consists of 4 sequential videos that we divide 
into a total of 36 segments, with each segment spanning 20 seconds; 
totaling less than 15 minutes, this qualifies as a short-course [8]. 


We let s = 1,2,...,5 denote the index of the segments in the course 
sequence. Our evaluation will focus on the roughly 12,000 learners 
who enrolled in this course over a six-month period in 2017. 
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Figure 1: Visualization of the topic distributions across video seg- 
ments in the course, as inferred by LDA. We see that videos tend to 
cover disparate sets of topics; therefore, this analysis does not help 
us to extract prerequisite structures. 


2.2 Data Capture 

We focus on two types of data captured by the platform: (i) video- 
watching clickstreams, which log each learner’s interactions with 
the video player, and (ii) transcripts of the course content, mea- 
sured in words. In total, this data consists of roughly 900,000 click- 
streams and 1,700 words across the video segments. 


Given such a limited text repository, relying on topic models alone 
to extract prerequisite structures is infeasible. Nonetheless, we in- 
corporate content data as one component of our methodology, since 
we seek to use any data sources available to aid the performance of 
our model. In later sections, we will experimentally validate the 
impact of this input on model performance, and the possibility of 
replacing it with other data. 


Video-watching clickstreams. The data capture architecture 
for our platform is event-driven, i.e., each event that a learner makes 
is recorded. The following is the space of actions available to a 
learner on the video scrub bar: Play (P1), Pause (Pa), Skip forward 
(Sf), and Skip backward (Sb). There are also actions available out- 
side of the scrubber: Enter video (En), Exit video (Ex), Window 
foreground (Wf), and Window background (Wx), where Wf and Wx 
dictate whether the course application is the current selection on 
the device. Formally, the ith event created by learner u in the course 
will be in the format 


E,(i) =< v(i),a(i),8'(i), 8(i), P@),b@ >, 


where v(i) is the video ID and a(i) is the type of action. s(i) is the 
segment of the video player immediately after e(7) was fired, while 
s (i) is the one immediately before. p(i) is the UNIX timestamp (in 
seconds) of this event, and b(7) € {playing, paused} is the binary 
state of the video player immediately after i happens. 


For a video with multiple segments, when the learner plays through 
the end of s, an event with a(i) = play, s’(i) =s, and s(i) =s+1 
will be generated. 


Course content. The videos originate in .mp4 format for deliv- 
ery to learners. To obtain the text transcripts, we divide videos to 
length of 20-second long segments and employ open source speech- 
to-text conversion software, creating one output for each segment 
and further correcting any translation mistakes manually. Con- 
cretely, the output for segment s in the bag-of-words representation 
Xs over a dictionary 2° = {w1,w2,...}, where x;(k) is the number 
of times word wz € 2 appears in s. 


To further motivate our behavior-based approach to inferring pre- 
requisites, in Figure 1 we show the progression of topics through 
the segments in the course as inferred by the latent Dirichlet alloca- 
tion (LDA) topic analysis algorithm [2]. LDA extracts document- 
topic and topic-word distributions from a corpus of text separated 
into documents; here, segments are treated as separate documents, 
and the segment-topic distributions are plotted. According to this 
model, each video focuses on fairly independent topics, with min- 
imal overlap (e.g., the segments in the first video focus heavily 
on topic 3, while those in the third focus almost entirely on topic 
5). This analysis shows how topic analysis alone provides lim- 
ited insights into prerequisite structures which likely extend across 
videos, a point we will verify later in our model evaluation. 


2.3 Feature Construction 
We construct two types of features from our data: (i) video-watching 
behaviors and (ii) text embedding vectors. The behaviors are learner- 
specific, while the text vectors are not. 


Video-watching behaviors. Let s(u,t) denote the segment 
learner wu visited at time index ¢ € {1,...,T%}, with 7, being the 
total number of (not necessarily unique) segments u visited. The 
time instance here increments whenever the learner transitions to a 
different segment, i.e., s(u,t) A s(u,t+1). In our model, we con- 
sider the behavior of learner u at time f as a feature f,,; € F, where 
F = {LE,HE,SB,SF} is a set of four states summarizing behavior 
on a segment: Low Engagement (LE), High Engagement (HE), Skip 
Back (SB), and Skip Forward (SF). 


fut is determined by analyzing the set of measurements E,,; that 
occur for learner u during time ¢. Letting i(r) and i(t +1) be the 
indices of the events where u transitions to s(u,t)! and s(u,t +1), 
respectively, then Ey; = {Ey (i) : i(t) <i< i(t+1)}. From this, we 
first calculate the time spent on s(u,t) by aggregating the changes in 
timestamps between sequential events in E,,;, excluding any points 
of the app in the background that indicate learner off-task behavior: 


ii+1CEus 

a(i)AWx 

where 7, = 300 sec is an upper bound for idle time on each 20 
second segment. 


Ms(yt) = 


If m4) < 3, then we infer that the learner has skipped over s(u,t); 
in this case, if s(u,t+1) > s(u,t), then it is a forward skip and 
fut = SF, whereas if s(u,t-+1) < s(u,t) then it is backwards and 
fut = SB. On the other hand, if Ms(y) 2 3, then the learner has 
engaged with the segment; similar to [8], we quantify engagement 
on s(u,t) as 


_ 1 + mo(y,4)/Ms - 
€s(ut) (™) = ( 2 ) ) 


‘in other words, i(t) =i: s'(i) 4 s(u,t),s(i) = s(u,t). 
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Figure 2: Roll-out visualization of the architecture of our RNN-based learner behavior model. At time t+ 1, we use the observed learner 
behavior f,,; at time ¢, the learner’s prior knowledge state ,;, and the knowledge contained in the previous segment s(u,t) to update their 
current knowledge state. Then, we calculate the prerequisite knowledge gap and the learning goal knowledge gap using the prerequisite 
structure R among segments, which decide the learner’s behavior at time f + 1. Latent variable dependencies are denoted by solid arrows, 


while the prerequisite dependencies are denoted by dashed arrows. 


where 71s is the expected time spent on s, and @ € (0, 1] is a param- 
eter for the diminishing marginal returns of time spent on engage- 
ment.” Intuitively, Ms(yt) > Ms gives eg) > 1. With this—and 
Ms(y,4) 2 3—we specify: if e,(,,,) < 1, then engagement is low and 
Fut = LE, whereas if es(ut) 2 1 it is high and f,,; = HE. 


Course content embeddings. We now detail our approach to 
processing course content data into features. As discussed, due to 
the limited textual information in this application, applying stan- 
dard natural language processing methods (such as word count 
techniques [17] and LDA) may not be sufficient. Instead, we re- 
sort to statistical language models that are pre-trained on web-scale 
data; in particular, we use GloVe embeddings [26], a word-to- 
vector mapping pre-trained on the Wikipedia 2014 and Gigaword 
5 datasets. These embeddings are well suited as inputs to RNNs, 
since the Euclidean distance (or cosine similarity) between GloVe 
vectors provide useful insights into the linguistic similarities be- 
tween the corresponding words [26]. 


Specifically, we seek a vector representation ys for segment s that 
quantifies the material covered in s based on the bag-of-words 
xs. We first map each word wz € 2 to its corresponding vector 
yx € R! in the pre-trained GloVe library,? where 100 is the choice 
of dimension in the pre-trained embedding. We then aggregate the 
word vectors in s to obtain the embedding y, = Y,.Xs(k)-y, € RI. 
To reduce the number of parameters under consideration, we fur- 
ther perform dimensionality reduction of ys via principal compo- 
nent analysis (PCA) [15], obtaining y, € R? for a parameter D; y, 
is taken as the top-D principal components of the PCA. We will 
consider the choice of D in our experiments section. 


3. RNN-BASED MODEL 


We now propose an RNN-based probabilistic model for learner be- 
havior that uses the features defined in Section 2. The reason that 


?For the 20 sec video segments in this course, we set 77; = 20 and 
a = 0.1 by default. 


https ://nlp.stanford.edu/projects/glove/ 


we choose RNN as a basis is that it is often used to model sequen- 
tial data, such as text [13, 16] and user purchasing activities [25], 
which is characteristic of learner behavioral sequences as well. 


Our overall model architecture is visualized in Figure 2. It consists 
of two main parts described in this section: (i) a latent knowledge 
state transition model, and (ii) a learner behavior model. 


3.1 Latent Knowledge State Transition Model 
The state transition model is similar to that of generic RNNs. 
In our context, the transition is induced by gaining knowledge 
from watching a video segment. Letting h,; € IR* denote the K- 
dimensional knowledge state vector of learner u at time t, we model 
the transition as 


hy =o (Wh, ;—1 +hit-1 +b) ) (1) 


where W € R*** denotes the state transition parameter matrix and 
b ¢ R¥ denotes the bias vector. o(-) is a nonlinear function, for 
which we will test a range of possible nonlinearities later in the 
experiments section. ],,;—1 is defined as 


lut-1 = Cut—1UY (4 4— 1) 


to quantify the amount of knowledge the learner acquires from 
watching segment s(u,t— 1) (a setting that follows [21]); UY s(ut—1) 
captures the knowledge contained in this segment, since y.(,,;—1) is 
its GloVe embedding and U € R**? is the input parameter matrix 
that maps its text embedding to latent knowledge, while e, ;_1 is the 
scalar engagement variable which dictates the amount of Uy.(,,+— 1) 
transferred to the learner. We parameterize e,,; with the behavioral 
feature fit: 


en if fuy = HE 
Cut = ey if fuy =LE 
O if fur € {SB, SF}. 


Here, e;,,¢; € [0, 1] are parameters that characterize specific engage- 
ment levels that HE and LE correspond to in our model. If a learner 
skips a video segment, their engagement level is zero, so no knowl- 
edge is gained. 
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Note that this characterization of engagement differs from that de- 
scribed in [8,20]. In our model, when there is no knowledge input 
(1:1 = 9), W and b can be used to characterize other causes of 
knowledge state transition, e.g., forgetting. For another example on 
the relationship between engagement and learning, see [29]. 


3.2 Learner behavior model 

The behavior model concerns the feature variable f,,;. We model 
the probability that a learner selects each f € F with the following 
softmax distribution: 


ey [gir Zi | E +dy 


P(fut =f) = (2) 


Page? [gia Zial! +d pr : 


where the variables are g,, RX, Lut R¥\v f Rx ,and dr ER. 
The vectors v+ and the biases d, together with latent state variables 
2u,x and z,;, decide learner behaviors on each video segment. gy, 1 
denotes the prerequisite knowledge gap and z,; denotes the learn- 
ing goal knowledge gap; they are defined from the knowledge state 
transition model as follows: 


Prerequisite knowledge gap: 8ut ‘= Ps(u) — Tus is the prerequisite 
knowledge gap vector. ps denotes the required knowledge level of 
segment s, and r,, denotes the portion of learner u’s knowledge 
state at time f that is relevant to the prerequisite requirement of 
segment s(u,t). Concretely, r,; is defined as 


t—l 
Tut = y Ro(u,2),s(u.t) lat 
t=1 


where the matrix R ¢ {R,U0}5*%, at the core of our model, 
characterizes the prerequisite structure among segments. A large 
value of R, , implies segment s is a strong prerequisite of s’, while 
R, »' = 0 means s is not a prerequisite of s’, Note that the nonnega- 
tivity constraint placed on the prerequisite structure matrix is nec- 
essary for interpretability of the model parameters, since reversing 
the sign of every parameter would lead to the same data likelihood, 
rendering the model unidentifiable in the absence of this constraint. 


Learning goal knowledge gap: Zut := Cu — hy, ,;—1 denotes the learn- 
ing goal knowledge gap vector. ¢, characterizes the learning goal 
of learner u, i.e., a target knowledge state that they are satisfied 
upon reaching, while h,,;—; denotes their previous knowledge state. 
In general, ¢, can either be personally imposed (e.g., in optional, 
recreational learning) or externally enforced (e.g., in institutional- 
ized learning); for the course in this paper, it is the latter. 


Model intuition. Our model is based on the intuition that there 
are two factors driving a learner’s behavior while watching a par- 
ticular video segment. The setup of these two factors enables us 
to extract the prerequisite dependencies (R) among video segments 
by observing the sequences of learner behaviors. 


The first factor, parameterized by the prerequisite knowledge gap 
vector g,;, characterizes whether the learner possesses enough 
prerequisite knowledge to master the current segment. This gap 
is given by the difference between the knowledge level required 
to master the current segment (p,,,,;)) and the learner’s accumu- 
lated knowledge from prerequisite segments (r,). The learner 
would have gained such knowledge by exhibiting high engage- 
ment (f,,: = HE) on the prerequisite segments; if they do not have 
enough, they are more likely to skip backwards (f,,; = SB) to study 


further. 


The second factor, parameterized by the learning goal knowl- 
edge gap vector z,,;, characterizes whether the learner has already 
reached their learning goal. This gap is given by the difference 
between the goal (c,,) and the learner’s previous knowledge state 
(h,;—-1). If the learner has already accumulated enough knowl- 
edge, they are more likely to exhibit low engagement (f,,; = LE) or 
to skip forward (fut = SF). 


Parameter inference. We estimate the latent model parame- 
ters, i.e., the input, transition, and output parameters U, W, and 
v,, the biases b, the latent engagement level parameters e,, and e), 
the learning goal vectors c,,, and the prerequisite structure matrix R 
by using the Adagrad optimizer [11] to minimize the cross-entropy 
loss [12] on the observed behavior sequences. The cross-entropy 
loss is the standard loss function for categorical data (each category 
corresponds to a behavior in .Y = {LE, HE, SB, SF}). We implement 
our inference algorithm in TensorFlow. 


4. EXPERIMENTS 


In this section, we evaluate our model proposed in Section 3 on the 
product development course. We first describe our experimental 
setup, including training/validation and tuning procedures. Then, 
we investigate the ability of our model to predict learner behavior 
on future video segments, compared to baselines. Once we have 
established model quality, we perform an exploratory analysis of 
the prerequisite structure information in the model, and present the 
results from sharing these insights with a course administrator. 


4.1 Experimental Setup 


Training and validation. We partition the original dataset to 
two parts: (i) the training set, which is used to train models, and (ii) 
the validation set, which is used to evaluate prediction performance. 
We randomly select 90% of the learners to form the training set and 
use the remaining 10% as the test set. 


In each training epoch, we randomly select 800 learners from the 
training set and use their behavioral data to calculate the gradient 
of the overall cross-entropy loss with respect to our model parame- 
ters. We then take a gradient step using the Adagrad optimizer [11] 
and evaluate the prediction performance of our model on the vali- 
dation set. Note that since the learners in the validation set are not 
used in our training procedure, we do not have estimates of their 
target knowledge state vector c,. Therefore, we take the average 
of the estimated target knowledge state vectors over learners in the 
training set and use it for learners in the validation set. 


Metrics. We report the performance of our proposed model and 
baselines using two standard evaluation metrics on the validation 
dataset: (i) the cross entropy loss, and (ii) prediction accuracy, 
which is simply the percent of behaviors that are predicted cor- 
rectly. Lower loss and higher accuracy implies better performance. 


Baselines. We focus on shallow RNN-type networks as base- 
lines, since (i) they have been widely used to model sequential data 


4nttps://www.tensorflow.org/ 
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and (ii) they have a similar architecture to our model, thereby pro- 
viding a fair comparison. 


First, we consider an RNN model with content GloVe embeddings 
Ys as input and learner behaviors f,, as output, which we refer to 
as RNN-G: 


hy = 0 (UY s(,1) + Why, ;—1 +b) 


Als hy, +b; 


P( fut =f) = 


T : 
re — er hi: +b pr 


In RNN-G, the input at every time step does not contain the 
learner’s actual behavior in the last time step. Such a setting can 
be disadvantageous when the input provides only limited informa- 
tion on the current output. To investigate this, we also consider an 
RNN model that feeds the ground truth behavior from the last time 
step (fy,s-1) back into the model as input at the current time step, 
which we refer to as RNN-F: 


hy = o (Uf, 1 + Wh, +1 +b) 
Xp Bus tbe 


P(fut =f) = Th 
LpeF evr 


ust bp 


Here, we slightly abuse notation, using f,;—1 € {0, 1}17 | to denote 
the one-hot-encoded vector version of the observed learner action 
at time ¢t — | [12]. Note that this network structure has been used 
to model sequential data, e.g., text; this technique is sometimes 
referred to as teacher forcing [36]. 


These two baselines—RNN-G and RNN-F—can both use informa- 
tion from previous time steps for the prediction of learner behavior 
at the current time step. In some sequential prediction tasks, only 
recent information is needed, whereas in other scenarios, long-term 
dependencies must be considered; the latter may especially be true 
in learning given how material builds on itself [38]. Since neither 
RNN-G nor RNN-F support the use of information from several 
time steps back, we will also consider the long short-term memory 
(LSTM) network as a baseline algorithm, which we will refer to as 
LSTM. Similar to RNN-F, we use previous learner behavior as the 
input to the next time step in LSTM. The comparison between our 
model and LSTM will show which is better at storing and retrieving 
information from further back in time. 


Parameter tuning. Several parameters must be tuned to opti- 
mize the performance of each model. First is the dimension of 
the latent knowledge state vector K, which applies to all models: 
we sweep over K € {5,10,...,55}. Second is the dimension of 
the GloVe embedding D, for our model and RNN-G: we consider 
D € {5,10,...,45}, where D corresponds to the top-D principal 
components of the PCA on the segment vectors. 


We also examine the performance of our model with different 
choices of the nonlinearity function o(-). For this, we use the 
nonlinearities built in to TensorFlow: rectified linear units (relu), 
exponential linear units (elu), hyperbolic tangent (tanh), soft plus 
(softplus), and no nonlinearity (identity). 


Through our experiments, we found that a constant learning rate of 
0.01 and a total of 300-500 training epochs consistently led to the 
best results, for all three baseline algorithms. As a result, we will 
not perform more than 350 training epochs, since the performance 
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Figure 3: Performance of our model against the number of training 
epochs. While the training loss continues to decrease, the valida- 
tion loss stabilizes quickly after approximately 200 epochs. 


does not significantly improve after that. 


4.2 Prediction Performance 

We consider model performance against several parameters. When 
parameters are constant, they take the default values of K = 45, 
D = 30, and o = tanh. 


Varying number of training epochs. In Figure 3, we plot 
the cross entropy loss on both the training and validation sets, as 
well as the accuracy on the validation set, as the number of training 
epochs is varied for our model. We see that (i) the training loss 
exhibits a continually decreasing trend with minimal fluctuations, 
while (ii) the validation loss drops quickly initially but stabilizes 
after around 200 epochs, and (iii) the validation accuracy stabilizes 
quickly after about 20 epochs. Since the performance on the vali- 
dation set remains stable after a large number of epochs, we con- 
clude that our model does not easily overfit. In fact, implementing 
dropout regularization [30] showed minimal impact on the perfor- 
mance of our model. Therefore, we did not use dropout or any 
other form of regularization in our other experiments. 


Varying latent knowledge state dimension K. In Figure 4, 
we plot (a) the cross entropy loss and (b) the accuracy of all four 
models on the validation set against the dimension of the hidden 
layer K. Overall, we see that our model outperforms every base- 
line for each choice of K, and significantly so on the cross entropy 
loss metric, which demonstrates the ability of our model to accu- 
rately predict learner behavior. While all models show improving 
performance as K increases, after K = 10 the improvement for our 
model is minimal. The fact that both our model uses the same input 
information yet outperforms RNN-F justifies our particular design 
choices involving the prerequisite knowledge gap and learning goal 
knowledge gap vectors. 


We also see that RNN-G performs significantly worse than RNN- 
F. This observation suggests that the features given by the content 
data provide only limited information on learner behavior, which 
validates our conjecture that the learning content itself (i.e., the 
video transcripts) is a very limited data source. Finally, we note 
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Figure 4: Prediction performance on the validation set as the dimension of the latent knowledge state vector (K) is varied. Our model 
outperforms all baselines in each case tested, especially on the cross entropy loss metric, indicating an overall ability to predict learner 
behavior. Moreover, the performance of our model is robust to the choice of K. 


that among the baselines, LSTM slightly outperforms RNN-F, in- 
dicating that in our application of online learning, there is benefit 
to preserving information on behavior further back in time. 


Varying input dimension D. In Figure 5, we plot (a) the cross 
entropy loss and (b) the accuracy of our proposed model and RNN- 
G against the dimension of the input GloVe embedding D on the 
validation set. Overall, we see that the performance of both models 
is insensitive to the choice of D. One possible explanation is that 
even with very low-dimensional input (i.e., taking only the top few 
principal components), the embeddings still encapsulate the video 
transcript text effectively. To investigate this, in Figure 5, we label 
the percentage of variance explained by the top-D principal com- 
ponents of the GloVe embedding for every value of D. We see that 
the top-5 principal components (i.e., D = 5) explain about 95% of 
the total variance, which explains why increasing D beyond D = 5 
does not further improve the performance. This observation on the 
percentage of variance explained provides more evidence that the 
information contained in the textual content is limited. 


Varying nonlinearity o. In Table 1, we tabulate the cross en- 
tropy loss and accuracy of our model on the validation set using 
the different non-linearity functions o. Overall, while the elu non- 
linearity achieves the best performance when considering both met- 
rics, every choice of nonlinearity leads to very similar performance. 
This suggests that our model is robust to the choice of nonlinearity 
in the latent knowledge state transition. 


4.3 Prerequisite Structure Analysis 

Having established overall model quality, we now analyze the ex- 
tracted prerequisite structure, i.e., the model matrix R. In doing so, 
we will consider several examples that illustrate how the course was 
constructed, referring to the video titles and segment transcripts as 
needed. We then validate the insights through the results of a ques- 
tionnaire on some of the particular findings that was provided to a 
course administrator. This administrator possesses intimate knowl- 
edge of the course content and how it was constructed. 


To derive the insights, we consider two different cases of the ma- 
trix: (a) R across the entire course, obtained from extracting the 
prerequisite structure between all video segments, and (b) R” for 
each video v, from estimating the structure between segments in 
each video separately. Case (a) uses the results for K = 45, D= 
30, o = tanh from the previous experiment, while case (b) is a new 
experiment with these settings. 


4.3.1 Insights: Full course matrix 

Figure 6(a) visualizes R across the course. We focus on a few 
key findings here, some across videos and some for individual seg- 
ments. First is that segments in the last two videos have substan- 
tially more prerequisites than those in the first two. The only seg- 
ment with significant prerequisites in the first two is Segment 8, 
while the only one without significant prerequisites in the second 
two is Segment 13. At a high level, then, we can infer that the first 
two videos are laying the groundwork for material covered later on. 
This makes sense considering even just the titles of the videos, with 
the first two geared towards explaining the “vision” and reasoning 
for the development of this product, and the later two expounding 
on the product’s “features” and technical description.> 


For individual segments, consider Segment 8 from the previous dis- 
cussion. This segment has all previous ones as prerequisites, with 
some more significant than others. The transcript for this segment 
indicates a discussion on the demand for this type of product over 
the next several years, which is traditionally viewed as “problem- 
atic,” so it makes sense that learners should study Segments 0 to 7 
first to understand the “vision” of this version of the product to mit- 
igate the problem. Segments 1, 4, and 6 discuss “problem mitiga- 
tion” in particular, consistent with them being larger prerequisites. 
Another interesting case is Segment 26, for which there are several 
prerequisite segments throughout the course, but the one immedi- 
ately previous is not as significant. Segment 26 actually continues 
with the theme of “problem mitigation,’ which is discussed in Seg- 
ment 24 but not in Segment 25. Segments 4, 11, and 19 reference 
the particular method of “problem mitigation,” which is also con- 


>We omit exact video titles and transcripts in this section to pre- 
serve anonymity, but provide enough context for the key points. 
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Figure 5: Prediction performance on the validation set as the dimension of the input word embedding (D) is varied for both our model and 
RNN-F. For each point, we label the percentage of variance in the input explained by the top-D principal components. The performance 
remains largely unchanged as D increases in each case, which is consistent with over 98% of the variance being explained by the top-5 


principal components (i.e., D = 5). 


Activation Functions | Formula Accuracy Cross Entropy Loss 
relu o(x) =xifx >0, o(x) =O0ifx <0 0.861 0.444 
tanh o(x) = ES 0.861 0.445 
elu o(x) =xifx >0, o(x) =e’—-1lifx<0 0.861 0.443 
softplus o(x) =In(1+e*) 0.861 0.447 
identity O(x) =x 0.860 0.454 


Table 1: Performance of our model with different choices of nonlinearity o(-). Except for the identity (no nonlinearity) which performs 
worse, all nonlinearities lead to a similar performance, implying that our model is robust to the choice of nonlinearity. 


sistent with them being strong prerequisites to Segment 26. 


4.3.2 Insights: Individual video matrices 

Figure 6(b) visualizes R” for separate videos v. Compared with 
Figure 6(a), it is easier to compare segments within videos, but the 
relative magnitudes of prerequisites between videos is lost. For 
Video 4, we see that prerequisites within the video tend to become 
weaker as the video progresses, which is not obvious in Figure 6(a). 
For example, while Segment 23 has a heavy dependence on Seg- 
ment 22, Segment 34 is only lightly dependent on a few segments 
in the video. Being close to the end, Segment 34 is summarizing 
information across the course, which is evident through its prereq- 
uisites in Figure 6(a). The inferred relation between Segments 23 
and 24 is consistent with both of these segments’ transcripts dis- 
cussing particular technologies in the new product. 


Another insight is that with the exception of Video 3, the last seg- 
ment in each video has only light prerequisites within the video. 
Intuitively, we would expect last segments to summarize the ma- 
terial covered in the video, but such a review may not constitute 
a strong prerequisite. The transcript of Video 3’s concluding Seg- 
ment 21, on the other hand, indicates that it is a continuation of the 
“product features” discussion. 


4.3.3 Questionnaire and response 

The questionnaire provided to the course administrator began with 
a brief description of the algorithm and purpose. It then included a 
visualization of the R matrix, and an enumeration of several state- 


ments drawn from our insights ranging from conclusions on partic- 
ular segments to general trends across multiple segments. A sample 
statement provided is “this segment does not have any prerequi- 
sites, i.e., studying prior segments is not helpful to its understand- 
ing.” The task of the course administrator was to indicate their level 
of agreement with each statement on a five-point Likert Scale, from 
1 (strong disagreement) to 5 (strong agreement). 


80% of the responses we obtained to the statements were in the 
range of 4-5. This indicates that the course administrator generally 
agreed with the the prerequisite dependencies extracted by our al- 
gorithm, and in turn gives additional validity to our proposed model 
in terms of its ability to generate human-interpretable insights. 


The disagreements tended to be for statements that compared the 
magnitude to which two particular segments were prerequisites to 
another segment, i.e., claiming that one was a stronger prerequisite 
to the segment than the other. Since the agreements, by contrast, 
were on more general statements concerning the existence and/or 
strength of prerequisites to a given segment or group of segments 
(e.g., “segment | is a strong prerequisite to segment 2”, “segments 
in part 1 of the course tend to have more dependencies than seg- 
ments in part 3”), our algorithm may not differentiate magnitudes 
of prerequisites for a particular segment well. There are several 
possible reasons for this. One is the method used to segment the 
content: rather than choosing uniform 20 second chunks of video, 
for example, it may be desirable to incorporate segmentation into 
the modeling procedure, e.g., by maximizing the difference in pre- 
requisites between adjacent segments. Another is the treatment and 
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Figure 6: Visualizations of the prerequisite matrices extracted in two ways: (a) R across the entire course, and (b) R” for each video 
v separately. The (s,s’)th entry (the entry on the sth row and s’th column, with s < s’) characterizes how much segment s serves as a 
prerequisite of segment s’. The solid lines delineate the four different videos. 


presentation of the values comprising the R matrix: rather than re- 
porting these as real numbers, it may be desirable to group them 
into relative magnitudes, e.g., low/medium/high or a simple binary 
indicator of whether there is a noteworthy dependency. Educators 
may be more interested in broader distinctions. 


5. CONCLUSIONS AND FUTURE WORK 


In this paper, we have proposed a recurrent neural network-based 
model to extract prerequisite structure among fine-granular pieces 
of learning content. We modeled such prerequisite structure infor- 
mation as latent variables, and extracted it from learner behavioral 
data. We applied our model to an online course dataset that con- 
tains the clickstream activity behavioral data from 12,000 learners 
watching course videos. Our experiments showed that our model 
significantly outperforms baseline models in predicting learner be- 
havior and, more importantly, that it effectively extracts both intra- 
and inter-video prerequisite dependencies among video segments; 
we were able to verify these insights through responses to a ques- 
tionnaire provided to a course administrator. More generally, our 
work demonstrated that large-scale learner behavioral data can of- 
fer interesting insight into learning content; therefore, it is impor- 
tant to use learner behavioral data to aid content analytics, espe- 
cially when content data is sparse and learner performance data is 
unavailable. 


There are several avenues of future work. One is experimen- 
tally testing whether the extracted prerequisite structure can lead 
to better personalized remediation or enrichment activities selec- 
tion [5, 19,27]. Another is adapting our model to other content 
types, e.g., educational games [23]. Also, one can try to adapt out 
model to extract prerequisite structures in longer (e.g., semester- 
long) courses by aggregating learner behavior at a higher granular- 
ity level, and compare the results against that obtained via tradi- 
tional, content data-based methods. Moreover, to further improve 
the insights provided by our model, two approaches can be inves- 
tigated as discussed: incorporating segmentation into the model it- 
self to e.g., maximize the difference in prerequisites between adja- 
cent segments, and grouping the values in the R matrix into discrete 
categories. Finally, additional slack variables can be incorporated 
into to our model to allow variation in learner behaviors; learners 


sometimes make poor assessments about their prerequisite knowl- 
edge and are unable to navigate across the course efficiently. 


In particular, for personalization, note that the prerequisite struc- 
tures (the R matrix) our model extracts can drive automated content 
individualization. For example, when learner u reaches segment s’ 
at time t, a course delivery system could check whether the pre- 
requisite knowledge gap g,; > 0. If not, then a combination of 
segments s for which R, » is high and engagement e, is low (i.e., 
significant prerequisites that the learner has not studied) can be dis- 
played first. The system could then update g,,; as these prerequi- 
sites are studied, and “unlock” the segment s’ once the learner has 
engaged with them enough (when the prerequisite knowledge gap 
2: diminishes). We are currently implementing such a method. 
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